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Abstract 

In this paper, we introduce a new methodology for Bayesian variable selection in linear regression 
that is independent of the traditional indicator method. A diagonal matrix G is introduced to the 
prior of the coefficient vector (3, with each of the gj's, bounded between 0 and 1, on the diagonal 
serves as a stabilizer of the corresponding 13j. Mathematically, a promising variable has a gj value 
that is close to 0, whereas the value of gj corresponding to an unpromising variable is close to 1. This 
property is proven in this paper under orthogonality together with other asymptotic properties. 
Computationally, the sample path of each gj is obtained through Metropolis-within-Gibbs sampling 
method. Also, in this paper we give two simulations to verify the capability of this methodology 
in variable selection. 

Keywords: multiple linear regression; Bayesian variable selection; ^r-prior 


1 Introduction 


Consider the traditional multiple linear regression (MLR) model having the form 


y = X/3 e, e ~ AA(0, cj^I) 


( 1 . 1 ) 


where y is an n x 1 response vector, X an n x p data matrix, R a p x 1 coefficient vector and e 
the random error. In quite a few areas where the linear model applies, an intreresting yet very 
important fact is that only a small portion of variables affect the response whereas others are 


trivial (Jeffreys and Berger, 1991). A great many authors have discussed this topic from both the 


frequentist (for example, Ullah and Wang (2013)) and the Bayesian perspective (Walli and Wagn^ 


2011). In this paper, we proceed following the Bayesian path. 


In the Bayesian setting (see, for example, Miller (2002) for detail), the coefficient (3 is usually 


given a conventional < 7 -prior AA(0, ga‘^(X.^'K) ^), introduced in Zellner (1986). The g-prior has been 


















given much attention in Bayesian variable selection primarily because it leads to a computationally 
tractable Bayes Factor. By introducing an indicator vector, variables are selected and different 
subsets of variables are compared to each other, or to a reference, based on the value of Bayes 


Factor. Multiple works have been done to review this methodology. For a recent one, see Dey and 


Fokoue (2015). 


In detail, a random indicator vector 7 = (71,72,- •' ■,lpY' is injected to Equation (1.1), such 
that for each 7 ^, j = 1 , 2 ,... ,p, we have 


7i 


= 


1 if Xj appears in the model, 
0 otherwise. 


( 1 . 2 ) 


Thus, for each combination of Jj’s, Equation (1.1) is modified to 


y = X^/3^ + e, £ ~ AA(0, cr^I), 


where Xj is the subset of variables according to 7 and (3^ is the corresponding coefficient vector. 
There is a total of 2^ combinations of 7 , including the full model, 7 = 1 = (1,1,..., 1)^, and the 
null model, 7 = 0 = (0,0,..., 0)^. Eor each combination of 7 j’s, a corresponding density p{y \ 7 ) 
and the Bayes Eactor 


BF^i — 


v{y I 7) 


p{y I 1)’ 

A difficulty quickly arises when the dimensionality increases, due to the fact that this method 
searches through the model space of size 2"^. Certain works have been done to solve this problem. 


George and McCulloch (1993) proposed an empirical method of stochastic search variable selection 


(SSVS). Each (3j is selected or rejected based on a Monte Carlo average of 7 ^, coming from a Gibbs- 
sampler. Such Monte Carlo average of 'jj is called the posterior inclusion probability (PIP) of /3j. 


Similar work can be seen in Barbieri and Berger (2004), in which the authors proposed a median 


probability model rather than a highest probability model, and the variables are selected based on 


a criterion of PIPj > 0.5. Further, Fokoue (2007) modified the method in Barbieri and Berger 


(2004) to a prevalence model, which solved the problem that such median probability model may 


not exist. Certain works have been doen to summarize the Bayesian variable selection with the 
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indicator method. O’Hara and Sillanpaa (2011) provides a thorough review of different methods in 


Bayesian variable selection. Han and Carlin (2001) gives a comparison in detail of different empirical 


Bayes methods, especially the Markov Chain Monte Carlo (MCMC) methods, regarding the Bayes 
Factor. 

Certain thoughts have been given to the prior of f3 instead of the traditional g'-prior. [George 


and McCulloch (1997) provides a prior of f3^ that follows 


(3^ ~ M (0, D-),R-j,D.yy, 


(1.3) 


where D-y is a diagonal matrix and R-., is symmetric. Such prior gives a good generalization of 


g'-prior. Agliari and Parisetti (1988) gives an alternative that follows 






(1.4) 


where A-y is symmetric and weights different observations, but not the features. Also, see Fernandez 


et al. (2001) for a very detailed comparison of different prior choices for Bayesian variable selection. 


Moreover, multiple works have been done to extend the original Zellner’s sf-prior. Specifically, jLian^ 


et al. (2008) proposed a study on mixtures of priors which provides a family of hyperpriors on g 


while still preserves the tractability on the marginal likelihood. Bove and Held (2011) developed 


an extension of the classical Zellner’s i^-prior to generalized linear models, given a large family 


of hyperpriors on g. Maruyama and George (2011) introduced a fully Bayes formulation with an 


orthogonal decomposition on the matrix X^X-y, which resolves the issue of p > n. All the works 
mentioned above rely on the indicator method, which is classic but somewhat redundant. To its 
worst, the methods still have to face the model space of size 2^. In this work, we intend to get rid 
of this indicator method completely. 


On the other hand. Tipping (2001) introduced a method called the relevance vector machine 


(RVM) from the machine learning perspective that performs nonparametric variable selection. 
Retaining the traditional Gaussian prior on /3, with a little modification, each of the /3j’s follows a 


Gaussian prior (0, a- ) independently. The parameter aj serves a purpose as the stabilizer. That 
is, since the coefficient R- is a priori centered at 0, the prior variance become 0 as a,- —)• oo, and. 
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on the contrary, the prior of /3j becomes flat as aj —)> 0. Interestingly, as stated in Tipping (2004), 


combining the non-sparse Gaussian prior on /3 with a Gamma hyperprior on each of the a^’s, the 
marginal of (3 in fact becomes a multivariate t-distribution after integrating out the a^-’s, which 
leads the RVM to a sparse selection machine. This property of sparsity is even more elegant when 
the input in the linear model is raised from feature space to kernel space, which is the main focus 


m 


Tipping (2001 2004), but not in our work. 


Our work somewhat combines the methodology in George and McCulloch| ( |1997[ ) and Tipping 
(2004), but gets rid of the traditional indicator method completely. Section provides a thorough 
theoretical analysis on this new method, including the formulation, some important derivation, and 
some asymptotic properties. We introduce the computation of model htting in|^ Here we apply 
the method of Metropolis-within-Gibbs. In Section 4, we verify the ability of variable selection of 
this new methodology with two examples. Finally, we provide a summary in Section 5. 


2 The Av-G Formulation 


2.1 The hierarchical model for variable selection 


Given an MLR model with form (1.1), we inject a prior to the coefficient (3 having the form 
Mp{(3 I 0,kct 2(GX^XG)-1), where, in the variance of the prior, k > 0 controls the total scale of 
the variance, and G = diag{gi, g 2 , ■ ■ ■, gp) controls how “relevant” each dimension is, with each 
gj G (0,1) having an impact to the variance of the corresponding jdj. This is to some extent a 


combination between George and McCulloch (1997) and Tipping (2001). In comparison to George 


and McCulloch (1997), the diagonal matrix D in (1.3) is the matrix G ^ here, and R is the matrix 


(X^X) The essential difference is that we have discarded the indicator 7 . Also, in comparison 


to Tipping (2001), this prior can be seen as a parametric analogy to the prior given in RVM. 


Further, each of the ^j’s is assigned an i.i.d. Beta{a,b) prior, and by conjugacy k an inverse- 
gamma prior IG{a,6). We keep the setting in Zellner (1986) for cr^, that is, a Jeffreys’ prior 
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p(cr^) oc (u^) And thus, the formulation of the hierarchical model follows; 


y |/3,a^ ~ AAn (y |X/3,cr^l) 

/3|G,K,a2 ~ AAp (/3|0 ,kc72(GX'^XG)-i) 

p 

Cjr n Beta {gj \a,b) 
i=i 

K ~ IG{a, 9) 
p{a^) oc 


Directly following (2.1), the joint posterior is given by 


( 2 . 1 ) 


P (/3, n, G, cT^ |y) ~ p{y |/3, )p{f3 | G, k, ct^ )p{K)p{G)p{a‘^) 

~ (y |X/3, cT^l) X Afp {(3 |0, Atu2(GX^XG)-^) 

p 

X IG{a, X Beta {gj \a,b) x (cr^)“^ 

1=1 

~ |(T^l| ^'^%xp|-^(y-X/3)'^(y-X/3)| X 


KCT^fGX^XG) ^ 


- 1/2 


exp 


1 


2 kct2 


/3^GX^XG/3 


X K " ^ exp (-) X 


Har'dxM 


21-1 


cl=i 


( 2 . 2 ) 


From (2.2), it is of specific interest to examine the posterior of /3 and G. The former gives some 


intuition of the connection between this formulation and both the ordinary least square (OLS) 
estimation and the original Zellner’s (/-prior, whereas the latter is crucial in the understanding of 
variable selection with this model. 
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2.2 Posterior of (3 


Following (2.2), the posterior of (3 is given by 


p {P |K,G,o-^y|) ~ (y “ (y ~ ^P) 


exp 


P^GX^XGp 


2k(j‘^ 

~ AAp , 

where jlfj and are the posterior mean and variance and take one the form of 


(2.3) 


/l/3= X"X+-GX"XG XV 




E;3 = cjV X^X + -GX^ XG 


(2.4) 


From (2.3) and (2.4), we have the following asymptotic results. 


Lemma 2.1. Denote by P 
'(OLS)\ 


(OLS) 


the OLS estimator of p. For any k / 0, as G —>• 0, V —>• /3 


and —)• Var (^P ^ . 

Proof. The proof is rather straightforward. Given k ^ 0 and gj —)• 0, \/j, 


(OLS) 


and 


~ ^ 'Z^OLS) 

M/3 (X X) X y = p 


- f^iOLS) 


Eei^aUX^X) = Var 


( 3 ‘ 


Lemma 2.2. For any k ^ 0, as G —I, we have 


K ^{OLS) 
^ ’ 


□ 


which is the same as the posterior mean of P in Zellner’s g-prior. 
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Proof. Given k ^ 0 and gj —>■ 1, Vj 


/^/3 


X^X + -X^X ) X" y 


(X^X)-'x^y 


K+1 


K ^(OLS) 


^ “ 1 “ 1 


□ 


Lemma 2.1 states that given G approaches a null matrix, the posterior mean of (3 approaches 
the OLS estimator of (3. Also notice that G —)• 0 is equivalent to assigning a flat prior to /3, since 
the prior would have infinite variance. Thus it would lead to a posterior that is equivalent to OLS. 


Lemma 2.2 states that in the case where G approaches an identity matrix, the posterior mean of 
(3 converges to the case in the original Zellner’s 5 -prior, with the parameter k in this formulation 
being the same as the original parameter g. This result gives an intuition that the k-G formulation 
is indeed a generalization of Zellner’s 5 -prior. Also, it is of interest that as k —)• 00 the convergence 


from 5/3 to (3 


[OLS) 


does not require a specific matrix G. 


2.3 Posterior of G 


We then derive the posterior of G given y, k and by integrating out (3. 

p{G\y,a‘^,K) = [ p{y\p,a^)p{p\G)p{G)dp 
J0 


]j 5 eta( 5 j |a, 6 ) / Wn (y |X/3, cj^l) x Wp (/3 |0, kcj 2(GX^XG)"^) d/3 


i=i 


oc |G|“|Ip-G| 


X"X+ -GX^ XG 

K 


- 1/2 


(2.5) 


exp J ^y^X (^X^X + ^GX^XG ) X^y 


-1 


Unfortunately, the expression in (2.5) does not have a closed form. However, we could see 


, -1 


that the posterior properties of G relies much on the matrix (X^X -|- ^GX^XG) And yet we 
cannot proceed the analysis of posterior properties of G in the most general cases since this inverse 
matrix does not have a further expression in which the matrix G can be isolated. Figure gives 
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an intuition of the posterior of G in the case where p = 2. Without loss of generality, we assume 


xi is a promising variable while X 2 is not. In such case, we have x^^y = 0 and |x^y| ^ 0. Notice 
from the figure that the posterior of G is maximized roughly at 51 —t- 0 and g 2 ^ 1. This is crucial 
in linking the ac-G formulation and variable selection. Intuitively, we would expect a promising 
variable to have a corresponding gj close to 0 while an unpromising variable to have a gj close to 
1 . 



0.0 0.2 0.4 0.6 0.8 1.0 

91 



Figure I: Perspective and contour plot of p (G |y, cr^, k) 


2.4 A case of orthogonality 

As stated above, much of the posterior properties rely on (X^X + ^GX^XG) ^. Though at 
this point we are not able to proceed to the analysis of the most general case, the analysis under 


orthogonality where X^X = diag{'x.Jj(.j) is rather tangible. In this case, the posterior of G in (2.5) 


is simplified to 


P (G |y, k) oc i^gj (1 - gj)'' ^ {k + g]) 
j=i 

2 


2^-l/2 


exp < 


«(xjy 


2cr2xJxj (k + s-? 



( 2 . 6 ) 






Based on (2.6), the joint posterior density of G can be written as the product of the marginal 


posterior density functions of each gj's, which implies that the gj's are a posteriori independent 
under orthogonality. This simplifies the analysis of p(G|-) by analysing each individual posterior 
density p(( 7 jj-) with 


Pi9j\-)oc g] (1 - gj)’" ^ {k + g'j) 


- 1/2 


exp < 


Mxjy 


2a‘^xjxj U + g] 


(2.7) 


As was mentioned in the introduction, a crucial question with this formulation is: “how is the 
K — G methodology linked together with variable selection?” Such question can be seen in two 
ways. First, we answer how the promising variables lead to certain posterior properties of gj’s. 
And second, we answer why such properties of gj's indicate certain variables are promising and 
others are not. 


Theorem 2.1. A promising variable xj has a corresponding gj that is close to 0, whereas an 
unpromising variable has a corresponding gj that is close to 1. 

Proof. Without loss of generality, assume a = b = ^ and k = = 1. Given the posterior density 

p {gj\y, , nj such that 

P {9j\y, k) 9^^ (1 - 9j)~^^‘^ (l + 9j) exp 
= 9 ]'^^ (1 - (1 + 51) ^^^exp 



where 9j is the angle between xj and y, the general idea of the proof is that we find the gj that 
maximizes the posterior likelihood, i.e. the maximum a posteriori estimate for the two cases where 


xjy = 0 and xjy / 0. 

Unpromising variable. For an unpromising variable xj, it is reasonable to assume that cos 9j = 0. 


Therefore in (2.8) exp(-) = 1 and we are left with 


P {9j\y, 1 ^) « g^^ (1 - gj) (1 + gj) , 
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/ \-l /2 

which is an increasing function of gj on (0,1), as il + g'jj is monotone decreasing from 1 to 
and gy^ (1 — gj)~^^‘^ is monotone increasing and gy^ (1 — gj)~^^‘^ —>■ +oo as gj —)• 1. Therefore 
in the case where the variable xj is unpromising we have 


gj = argmax p [gj\y, k) = I . 
9j 


(2.9) 


Promising variable. For a promising variable xj, it is reasonable to assume that cos9j ~ 1. 


Since all the terms on the exponent in (2.8) are positive, exp(-) is a decreasing function of gj on 
(0,1). Further, although the value of exp(-) somewhat depends on ||y||, the exponential function 
dominates the whole posterior likelihood with even a moderate value of ||y||. Therefore we have 


gj = argniax p {gj\y, cr^, k) 

{ IlylP cos^ 6j 
2(1+ 5j) 

= 0 +. 


( 2 . 10 ) 


And thus concludes the proof of the theorem. 


□ 


Further, Corollary 2.1 provides a very useful result under orthogonality. 

Corollary 2.1. Under orthogonality, the posterior mean of under the k — G formulation, 

fif^, is an unbiased estimator of p. 

Proof. Denote pj as the posterior mean of the jth variable based on the k — G formulation. Under 


orthogonality, that is, X^X = diag{x^ Xj), the posterior mean of P in (2.4) is simplified to 




^ ( T 


K + gj 


J K + gj ^ 


As was shown above, we have gj —)■ 0 for a promising variable. Therefore in this case 


Pj 


K + 0 


T \-i T oiOLS) 
(x^x,) x^.y = /3] b 


Since is an unbiased estimator of Pj, pj is also unbiased. 
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On the other hand, if gj —)• 1, indicating the variable Xj does not belong to the trne model 
and = 0, the qnantity shonld capture the unpromising feature and converges to 0 itself. 

Therefore the bias also vanishes in this case. □ 


3 Aspects of Computation 

3.1 Conditional density of and K 

We then introduce the conditional distribution of k and which mostly serve for the compu¬ 


tational purpose. From (2.2), we obtain a closed-form expression of the conditional density of the 
scale parameter k, 

p(/^|/3,cJ^G,y) ~/G(d,0), (3.1) 

where 


~ - P > 

a — - -ho 

0 = ^ (/3 - /3of GX^XG (/3 - /3o) + 0. 


Likewise, the conditional density of ci^ also has a closed-form expression given by 


p {a^\(3,K, G,y) ^IG 


n + p 


y + ^ (/3 - 3)^X^X - 3) + ^ (/3 - GX^XG (/3 - /Iq)] , 


where 


(3.2) 


s2 = (y-X3)^(y-X3) 

3 = = (X^X)^^ X^3 


3.2 A usefnl sampling algorithm 

In this k-G formulation, there are four sets of parameters to be estimated from the data. 
Namely, (3 and G, each consisting of p individual parameters, and n and The MCMC method 
is very useful in this case to obtain the sample path of the parameters, and specifically, the Gibbs- 
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sampler is a very convenient tool. However, Gibbs-sampler does require the conditional or posterior 
density of the parameters to be known, or of closed-form. As we have addressed before, the exact 
form of the posterior of G is unknown. Fortunately, the Gibbs sampling of G can be replaced by 
a Metropolis step, which only requires the density to be known to a proportion. For each draw of 
G, the acceptance ratio is 


p(gW |y,c72,K) /J(G(*) |G(*-i)) 

p (Gh-i) k) j j (Gh-i) |g(*)) 


where p(G|-) is given by (2.5) and J(-) is the proposal distribution and is dehned as 




Beta 

i=i 


.(*) 


(3.3) 


Here we assume that the gj's within each draw are independent. The shape and scale parameters 
in Beta K may differ in various cases. As any typical Metropolis-Hastings algorithms, G^*^ 
is accepted as G^*^ with probability min(l,r). Thus, the whole Metropolis-within-Gibbs algorithm 
is given in Algorithmic 


Algorithm 1: The k-G formulation for Bayesian variable selection 

Input: data matrix X, response y, initial values (3^^\ and G^*^^ 

1 for t = 1 to T do 

Update (3^*^ based on p y^ as in (2.3); 

Update based on p y ^ as in ( |3.1[ ); 

Update (cj^)W based on p y^ as in (3.2); 

Accept G*^*) = G*^*) with probability min{l,r) as in (3.3); 


6 end 


Notice that the sampling order, that is, which parameters are updated first each time, is mostly 
arbitrary. We choose to update G last merely because it involves a Metropolis step, which is more 
complex than the Gibbs steps. 

In terms of varaible selection, we would expect the sample path of gj^s of a promising variable 
to be severely skewed to the right within in the support of (0,1), and vice versa. Or in terms of 
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the posterior mean of gj^ given by 


v^T it) 

= r ' . (34) 

a promising variable would have a that is close to 0 , and an unpromising variable close to 

1 . 


4 Numerical Examples and Discussion 

4.1 Simulations 

In this section we demonstrate our methodology with two simulated examples. First, consider 
again when p = 2. xi and X 2 both have 30 observations and come from an i.i.d. V( 0 , 1 ), and the 
true model is given by 

Vi = 2xii + V(0,1). 


Here xi is assumed to be the promising variable. Using Algorithm 1, we set the parameters as 
a = b = 0.5, a = 9 = 1, and T = 100000. In the Metropolis step, we use an independent uniform 
proposal distribution 




P Beta 
i=i 




1,1 


Figure provides a histogram of the sample path of gj's in the simulation. It is not surprising 
that gi is severely skewed to the right and concentrates toward 0 , which corresponds to xi being 
promising, whereas g 2 is severely skewed to the left and concentrates toward 1, corresponding to X 2 
being unpromising. Tableprovides a numerical summary of the gj’s. Due to its severe skewness, 
here we provide both the mean, denoted by gj, and the median, denoted by gj. The numerical 


Table 1: Numerical Summary oi gj, p = 2 


Variable 

ii 

Si 

Xl 

.0682 

.0427 

X 2 

.6986 

.8227 


summary of gj for each of the two variable reflects the theoretical deduction in Section 2. 
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Figure 2: Histogram oi gj, p = 2 

The second example extends the dimensionality mildly to p = 10. Still, all the predictors are 
i.i.d. from Af{0, 1). The true model is given by 


Hi = 2{xii + Xi2 + Xis) + A/'(0,1). 


The set-up of the algorithm is mostly the same as in the previous example, except that the prior 
parameters of gj are a = b = 0.3, instead of 0.5. In this case, the “U” shape of the Beta prior is 
more strict than before. Also we have T = 10000 in this case. Figure provides a comparison of 
the sample path of the gj's. Again, we have gi, g 2 , and gs close to 0, which corresponds to the 
associated predictors in the true model. 

4.2 Discussion 

In Section we introduced how this formulation is motivated by the posterior inclusion prob¬ 
ability (PIP) and the relevance vector machine (RVM). Here we discuss these connections in detail 
using the simulations above. 
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I-1-1-1-1-1 I-^-1-1-1-1 I-^-1-1-1-1 I-^-1-1-1-1 I-^-1-1-1-1 

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 


g.draws #1 g.draws #2 g.draws #3 g.draws #4 g.draws #5 



0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 


g.draws #6 g.draws#? g.draws #8 g.draws #9 g.draws #10 

Figure 3: Histogram of gj, p = 10 

As stated before, the value gj or gj of a promising variable is close to 0, so that the value of 
1 — or 1 — gj is close to 1. We can see to this quantity 1 — gj as an analogy to the PIP. However, 
since the procedure of computing PIP searches the space jj G {0,1}, whereas the computation of 
gj searches the space gj G (0,1), though both quantities are the average of their sample path, quite 
often the PIP equals to 1 for a promising variable while the value of gj or gj can hardly be 0. 

Table [^summarizes the quantities gj, gj, 1 — gj, 1 — and the corresponding PIP in the second 
simulation. For the promising variable xi, X2, and xg, gj's are roughly .01 while the PIPs equal 


Table 2: Numerical Summary of gj, p = 10 



ii 

1 -ii 

Si 

1 - Si 

PIP 


ii 

1 -ii 

i-ii 

Sj 

PIP 

Xl 

.0160 

.9840 

.0130 

.9870 

1.0000 

X6 

.4517 

.5483 

.3630 

.6370 

.0475 

X2 

.0150 

.9850 

.0124 

.9876 

1.0000 

X7 

.5510 

.4490 

.5364 

.4636 

.0403 

X3 

.6960 

.3040 

.7952 

.2048 

.0386 

X8 

.0196 

.9804 

.0157 

.9843 

1.0000 

X4 

.4086 

.6914 

.3186 

.6814 

.0775 

Xg 

.6762 

.3238 

.7713 

.2287 

.0366 

X5 

.6749 

.3251 

.7652 

.2348 

.0366 

xio 

.6352 

.3648 

.6992 

.3008 

.0606 
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to 1, and for the unpromising variables, gj’s are far from 0 while the PIPs are small. Also, it is of 
interest to notice that the promising variables selected by the two methods are identical although 
the two methodologies are of different origins. 


We then consider the connection between the k-G formulation and the relevance vector machine. 
One major similarity between the two is the role of the hyper-parameters. Both gj^s in this paper 


and the a^-’s in Tipping (2001) appear in the prior variance of (3. In fact, both gj and aj serves as 
the “stabilizer”. That is, given a Gaussian prior centered at 0, a large value of gj or aj yields a 
high prior precision, or low prior variance of f3j, so that the prior of /3j is essentially 0. However, 
unlike Tipping (2001), in which the prior variance of f5j is solely gj is only part of the 


variance, so that it is not necessary to set gj G (0, oo), but only a bounded domain between 0 and 1 
is sufficient. Also, in terms of sparsity, the k-G is designed as a sparse machine, that is, we would 
expect that only a few variables affect the response by assigning a “U-shaped” Beta hyperprior to 
the parameter gj. 

It is also of interest to verify Corollary 2.1, which indicates, under orthogonality, the unbiased- 


-(Sa^es) 


'(Bayes) 


\{OLS) 


in 


ness of j3 under this formulation. Table provides a comparison of (3 and (3 

the second simulation. Given the true values as Pi = j32 = Ps = 2, with 10000 iterations, the 


Table 3: Comparison of and 



k-G 

OLS 


k-G 

OLS 

A 

1.9901 

1.9939 


-.0306 

-.0405 

P2 

1.9836 

1.9863 

P7 

-.0116 

-.0192 

h 

.0089 

.0190 

h 

1.9517 

1.9545 

Pa 

.0543 

.0709 

P9 

-.0071 

-.0124 

h 

.0022 

.0023 

PlO 

.0323 

.0551 


estimates from the methodology of this paper are very close to the OLS estimates. 


5 Conclusion 

In this paper we have demonstrated a new methodology for Bayesian variable selection in linear 
model that is completely independent to the traditional indicator variable method. The coefficient 
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vector f3 is given a Gaussian prior with the form J\fp (O, fi;cr^(GX^XG)“^). By injecting a diagonal 
matrix G to the variance of the prior, each gj on the diagonal serves as a variance stabilizer such 
that the promising variables are selected based on the gj^s that are close to 0. Mathematically, 
under orthogonality, the gj's are independent and the posterior of each single gj is maximized in the 
support (0,1) at gj —)• 0 if the corresponding variable is promising, and vice versa. Further, the 
estimator of (3 under orthogonality is asymptotically unbiased. Computationally, the hierarchical 
model is fitted using the Metropolis-within-Gibbs sampling method. 

In Section we have demonstrated through two simulations the usefulness of this methodology 
under orthogonality. Though the dimensionalities in each simulation, p = 2 and p = 10 respectively, 
are very mild, the results have shown that this formulation is capable of variable selection and 
parameter estimation, both with considerable accuracy. The systematic or theoretical examination 
outside orthogonality is still remained undone, in which the main difficulty involves the inverse 
matrix (X^X + ^GX^XG) ^. In conclusion, as it is completely independent of searching through 
the 2^ model space, this methodology has the potential of selecting variables with higher efficiency 
comparing to the traditional methodology and merits further interest and investigation. 
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